ACL.2023 - Industry Track | Cool Papers - Immersive Paper Discovery

#1 CWSeg: An Efficient and General Approach to Chinese Word Segmentation [PDF³] [Copy] [Kimi⁴]

Authors: Dedong Li ; Rui Zhao ; Fei Tan

In this work, we report our efforts in advancing Chinese Word Segmentation for the purpose of rapid deployment in different applications. The pre-trained language model (PLM) based segmentation methods have achieved state-of-the-art (SOTA) performance, whereas this paradigm also poses challenges in the deployment. It includes the balance between performance and cost, segmentation ambiguity due to domain diversity and vague words boundary, and multi-grained segmentation. In this context, we propose a simple yet effective approach, namely CWSeg, to augment PLM-based schemes by developing cohort training and versatile decoding strategies. Extensive experiments on benchmark datasets demonstrate the efficiency and generalization of our approach. The corresponding segmentation system is also implemented for practical usage and the demo is recorded.

#2 “Knowledge is Power”: Constructing Knowledge Graph of Abdominal Organs and Using Them for Automatic Radiology Report Generation [PDF] [Copy] [Kimi⁴]

Authors: Kaveri Kale ; Pushpak Bhattacharyya ; Aditya Shetty ; Milind Gune ; Kush Shrivastava ; Rustom Lawyer ; Spriha Biswas

In conventional radiology practice, the radiologist dictates the diagnosis to the transcriptionist, who then prepares a preliminary formatted report referring to the notes, after which the radiologist reviews the report, corrects the errors, and signs off. This workflow is prone to delay and error. In this paper, we report our work on automatic radiology report generation from radiologists’ dictation, which is in collaboration with a startup about to become Unicorn. A major contribution of our work is the set of knowledge graphs (KGs) of ten abdominal organs- Liver, Kidney, Gallbladder, Uterus, Urinary bladder, Ovary, Pancreas, Prostate, Biliary Tree, and Bowel. Our method for constructing these KGs relies on extracting entity1-relation-entity2 triplets from a large collection (about 10,000) of free-text radiology reports. The quality and coverage of the KGs are verified by two experienced radiologists (practicing for the last 30 years and 8 years, respectively). The dictation of the radiologist is automatically converted to what is called a pathological description which is the clinical description of the findings of the radiologist during ultrasonography (USG). Our knowledge-enhanced deep learning model improves the reported BLEU-3, ROUGE-L, METEOR, and CIDEr scores of the pathological description generation by 2%, 4%, 2% and 2% respectively. To the best of our knowledge, this is the first attempt at representing the abdominal organs in the form of knowledge graphs and utilising these graphs for the automatic generation of USG reports. A Minimum Viable Product (MVP) has been made available to the beta users, i.e., radiologists of reputed hospitals, for testing and evaluation. Our solution guarantees report generation within 30 seconds of running a scan.

#3 Hunt for Buried Treasures: Extracting Unclaimed Embodiments from Patent Specifications [PDF] [Copy] [Kimi³]

Authors: Chikara Hashimoto ; Gautam Kumar ; Shuichiro Hashimoto ; Jun Suzuki

Patent applicants write patent specificationsthat describe embodiments of inventions. Some embodiments are claimed for a patent,while others may be unclaimeddue to strategic considerations. Unclaimed embodiments may be extracted byapplicants later and claimed incontinuing applications togain advantages over competitors. Despite being essential for corporate intellectual property (IP) strategies,unclaimed embodiment extraction is conducted manually,and little research has been conducted on its automation. This paper presents a novel task ofunclaimed embodiment extraction (UEE)and a novel dataset for the task. Our experiments with Transformer-based modelsdemonstratedthat the task was challenging as it requiredconducting natural language inference onpatent specifications, which consisted oftechnical, long, syntactically and semanticallyinvolved sentences. We release the dataset and code to foster this new area of research.

#4 MathPrompter: Mathematical Reasoning using Large Language Models [PDF] [Copy] [Kimi²]

Authors: Shima Imani ; Liang Du ; Harsh Shrivastava

Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks and often provide incorrect answers. Unlike natural language understanding, math problems typically have a single correct answer, making the task of generating accurate solutions more challenging for LLMs. To the best of our knowledge, we are not aware of any LLMs that indicate their level of confidence in their responses which fuels a trust deficit in these models impeding their adoption. To address this deficiency, we propose ‘MathPrompter’, a technique that improves performance of LLMs on arithmetic problems along with increased reliance in the predictions. MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple algebraic expressions or python functions to solve the same math problem in different ways and thereby raise the confidence level in the output results. This is in contrast to other prompt based CoT methods, where there is no check on the validity of the intermediate steps followed. Our technique improves over state-of-the-art on the ‘MultiArith’ dataset (78.7% - 92.5%) evaluated using 175B parameter GPT-based LLM.

#5 Constrained Policy Optimization for Controlled Self-Learning in Conversational AI Systems [PDF] [Copy] [Kimi¹]

Authors: Mohammad Kachuee ; Sungjin Lee

Recently, self-learning methods based on user satisfaction metrics and contextual bandits have shown promising results to enable consistent improvements in conversational AI systems. However, directly targeting such metrics by off-policy bandit learning objectives often increases the risk of making abrupt policy changes that break the current user experience. In this study, we introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints. For example, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. We present a novel meta-gradient learning approach that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains. We conducted extensive experiments on a real-world conversational AI and using a set of realistic constraint benchmarks. The proposed approach has been deployed in production for a large-scale commercial assistant, enabling the best balance between the policy value and constraint satisfaction rate.

#6 pNLP-Mixer: an Efficient all-MLP Architecture for Language [PDF] [Copy] [Kimi²]

Authors: Francesco Fusco ; Damian Pascual ; Peter Staar ; Diego Antognini

Large pre-trained language models based on transformer architectureƒhave drastically changed the natural language processing (NLP) landscape. However, deploying those models for on-device applications in constrained devices such as smart watches is completely impractical due to their size and inference cost. As an alternative to transformer-based architectures, recent work on efficient NLP has shown that weight-efficient models can attain competitive performance for simple tasks, such as slot filling and intent classification, with model sizes in the order of the megabyte. This work introduces the pNLP-Mixer architecture, an embedding-free MLP-Mixer model for on-device NLP that achieves high weight-efficiency thanks to a novel projection layer. We evaluate a pNLP-Mixer model of only one megabyte in size on two multi-lingual semantic parsing datasets, MTOP and multiATIS. Our quantized model achieves 99.4% and 97.8% the performance of mBERT on MTOP and multiATIS, while using 170x less parameters. Our model consistently beats the state-of-the-art of tiny models (pQRNN), which is twice as large, by a margin up to 7.8% on MTOP.

#7 Extracting Text Representations for Terms and Phrases in Technical Domains [PDF²] [Copy] [Kimi³]

Authors: Francesco Fusco ; Diego Antognini

Extracting dense representations for terms and phrases is a task of great importance for knowledge discovery platforms targeting highly-technical fields. Dense representations are used as features for downstream components and have multiple applications ranging from ranking results in search to summarization. Common approaches to create dense representations include training domain-specific embeddings with self-supervised setups or using sentence encoder models trained over similarity tasks. In contrast to static embeddings, sentence encoders do not suffer from the out-of-vocabulary (OOV) problem, but impose significant computational costs. In this paper, we propose a fully unsupervised approach to text encoding that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices. Models trained with this approach can not only match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster, even on high-end GPUs.

#8 CocaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval [PDF] [Copy] [Kimi¹]

Authors: Jiapeng Wang ; Chengyu Wang ; Xiaodan Wang ; Jun Huang ; Lianwen Jin

Large-scale pre-trained text-image models with dual-encoder architectures (such as CLIP) are typically adopted for various vision-language applications, including text-image retrieval. However, these models are still less practical on edge devices or for real-time situations, due to the substantial indexing and inference time and the large consumption of computational resources. Although knowledge distillation techniques have been widely utilized for uni-modal model compression, how to expand them to the situation when the numbers of modalities and teachers/students are doubled has been rarely studied. In this paper, we conduct comprehensive experiments on this topic and propose the fully-Connected knowledge interaction graph (Coca) technique for cross-modal pre-training distillation. Based on our findings, the resulting CocaCLIP achieves SOTA performances on the widely-used Flickr30K and MSCOCO benchmarks under the lightweight setting. An industry application of our method on an e-commercial platform further demonstrates the significant effectiveness of CocaCLIP.

#9 KG-FLIP: Knowledge-guided Fashion-domain Language-Image Pre-training for E-commerce [PDF] [Copy] [Kimi¹]

Authors: Qinjin Jia ; Yang Liu ; Daoping Wu ; Shaoyuan Xu ; Huidong Liu ; Jinmiao Fu ; Roland Vollgraf ; Bryan Wang

Various Vision-Language Pre-training (VLP) models (e.g., CLIP, BLIP) have sprung up and dramatically advanced the benchmarks for public general-domain datasets (e.g., COCO, Flickr30k). Such models usually learn the cross-modal alignment from large-scale well-aligned image-text datasets without leveraging external knowledge. Adapting these models to downstream applications in specific domains like fashion requires fine-grained in-domain image-text corpus, which are usually less semantically aligned and in small scale that requires efficient pre-training strategies. In this paper, we propose a knowledge-guided fashion-domain language-image pre-training (FLIP) framework that focuses on learning fine-grained representations in e-commerce domain and utilizes external knowledge (i.e., product attribute schema), to improve the pre-training efficiency. Experiments demonstrate that FLIP outperforms previous state-of-the-art VLP models on Amazon data and on the Fashion-Gen dataset by large margins. FLIP has been successfully deployed in the Amazon catalog system to backfill missing attributes and improve the customer shopping experience.

#10 Domain-specific transformer models for query translation [PDF] [Copy] [Kimi¹]

Authors: Mandar Kulkarni ; Nikesh Garera ; Anusua Trivedi

Due to the democratization of e-commerce, many product companies are listing their goods for online shopping. For periodic buying within a domain such as Grocery, consumers are generally inclined to buy certain brands of products. Due to a large non-English speaking population in India, we observe a significant percentage of code-mix Hinglish search queries e.g., sasta atta. An intuitive approach to dealing with code-mix queries is to train an encoder-decoder model to translate the query to English to perform the search. However, the problem becomes non-trivial when the brand names themselves have Hinglish names and possibly have a literal English translation. In such queries, only the context (non-brand name) Hinglish words needs to be translated. In this paper, we propose a simple yet effective modification to the transformer training to preserve/correct Grocery brand names in the output while selectively translating the context words. To achieve this, we use an additional dataset of popular Grocery brand names. Brand names are added as tokens to the model vocabulary, and the token embeddings are randomly initialized. Further, we introduce a Brand loss in training the translation model. Brand loss is a cross entropy loss computed using a denoising auto-encoder objective with brand name data. We warm-start the training from a public pre-trained checkpoint (such as BART/T5) and further adapt it for query translation using the domain data. The proposed model is generic and can be used with English as well as code-mix Hinglish queries alleviating the need for language detection. To reduce the latency of the model for the production deployment, we use knowledge distillation and quantization. Experimental evaluation indicates that the proposed approach improves translation results by preserving/correcting English/Hinglish brand names. After positive results with A/B testing, the model is currently deployed in production.

#11 Label efficient semi-supervised conversational intent classification [PDF] [Copy] [Kimi¹]

Authors: Mandar Kulkarni ; Kyung Kim ; Nikesh Garera ; Anusua Trivedi

To provide a convenient shopping experience and to answer user queries at scale, conversational platforms are essential for e-commerce. The user queries can be pre-purchase questions, such as product specifications and delivery time related, or post-purchase queries, such as exchange and return. A chatbot should be able to understand and answer a variety of such queries to help users with relevant information. One of the important modules in the chatbot is automated intent identification, i.e., understanding the user’s intention from the query text. Due to non-English speaking users interacting with the chatbot, we often get a significant percentage of code mix queries and queries with grammatical errors, which makes the problem more challenging. This paper proposes a simple yet competent Semi-Supervised Learning (SSL) approach for label-efficient intent classification. We use a small labeled corpus and relatively larger unlabeled query data to train a transformer model. For training the model with labeled data, we explore supervised MixUp data augmentation. To train with unlabeled data, we explore label consistency with dropout noise. We experiment with different pre-trained transformer architectures, such as BERT and sentence-BERT. Experimental results demonstrate that the proposed approach significantly improves over the supervised baseline, even with a limited labeled set. A variant of the model is currently deployed in production.

#12 xPQA: Cross-Lingual Product Question Answering in 12 Languages [PDF] [Copy] [Kimi¹]

Authors: Xiaoyu Shen ; Akari Asai ; Bill Byrne ; Adria De Gispert

Product Question Answering (PQA) systems are key in e-commerce applications as they provide responses to customers’ questions as they shop for products. While existing work on PQA focuses mainly on English, in practice there is need to support multiple customer languages while leveraging product information available in English. To study this practical industrial task, we present xPQA, a large-scale annotated cross-lingual PQA dataset in 12 languages, and report results in (1) candidate ranking, to select the best English candidate containing the information to answer a non-English question; and (2) answer generation, to generate a natural-sounding non-English answer based on the selected English candidate. We evaluate various approaches involving machine translation at runtime or offline, leveraging multilingual pre-trained LMs, and including or excluding xPQA training data. We find that in-domain data is essential as cross-lingual rankers trained on other domains perform poorly on the PQA task, and that translation-based approaches are most effective for candidate ranking while multilingual finetuning works best for answer generation. Still, there remains a significant performance gap between the English and the cross-lingual test sets.

#13 Learn over Past, Evolve for Future: Forecasting Temporal Trends for Fake News Detection [PDF] [Copy] [Kimi¹]

Authors: Beizhe Hu ; Qiang Sheng ; Juan Cao ; Yongchun Zhu ; Danding Wang ; Zhengjia Wang ; Zhiwei Jin

Fake news detection has been a critical task for maintaining the health of the online news ecosystem. However, very few existing works consider the temporal shift issue caused by the rapidly-evolving nature of news data in practice, resulting in significant performance degradation when training on past data and testing on future data. In this paper, we observe that the appearances of news events on the same topic may display discernible patterns over time, and posit that such patterns can assist in selecting training instances that could make the model adapt better to future data. Specifically, we design an effective framework FTT (Forecasting Temporal Trends), which could forecast the temporal distribution patterns of news data and then guide the detector to fast adapt to future distribution. Experiments on the real-world temporally split dataset demonstrate the superiority of our proposed framework.

#14 AVEN-GR: Attribute Value Extraction and Normalization using product GRaphs [PDF] [Copy] [Kimi²]

Authors: Thomas Ricatte ; Donato Crisostomi

Getting a good understanding of the user intent is vital for e-commerce applications to surface the right product to a given customer query. Query Understanding (QU) systems are essential for this purpose, and many e-commerce providers are working on complex solutions that need to be data efficient and able to capture early emerging market trends. Query Attribute Understanding (QAU) is a sub-component of QU that involves extracting named attributes from user queries and linking them to existing e-commerce entities such as brand, material, color, etc. While extracting named entities from text has been extensively explored in the literature, QAU requires specific attention due to the nature of the queries, which are often short, noisy, ambiguous, and constantly evolving. This paper makes three contributions to QAU. First, we propose a novel end-to-end approach that jointly solves Named Entity Recognition (NER) and Entity Linking (NEL) and enables open-world reasoning for QAU. Second, we introduce a novel method for utilizing product graphs to enhance the representation of query entities. Finally, we present a new dataset constructed from public sources that can be used to evaluate the performance of future QAU systems.

#15 GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model [PDF] [Copy] [Kimi²]

Authors: Shicheng Tan ; Weng Lam Tam ; Yuanchun Wang ; Wenwen Gong ; Shu Zhao ; Peng Zhang ; Jie Tang

Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.

#16 FashionKLIP: Enhancing E-Commerce Image-Text Retrieval with Fashion Multi-Modal Conceptual Knowledge Graph [PDF] [Copy] [Kimi¹]

Authors: Xiaodan Wang ; Chengyu Wang ; Lei Li ; Zhixu Li ; Ben Chen ; Linbo Jin ; Jun Huang ; Yanghua Xiao ; Ming Gao

Image-text retrieval is a core task in the multi-modal domain, which arises a lot of attention from both research and industry communities. Recently, the booming of visual-language pre-trained (VLP) models has greatly enhanced the performance of cross-modal retrieval. However, the fine-grained interactions between objects from different modalities are far from well-established. This issue becomes more severe in the e-commerce domain, which lacks sufficient training data and fine-grained cross-modal knowledge. To alleviate the problem, this paper proposes a novel e-commerce knowledge-enhanced VLP model FashionKLIP. We first automatically establish a multi-modal conceptual knowledge graph from large-scale e-commerce image-text data, and then inject the prior knowledge into the VLP model to align across modalities at the conceptual level. The experiments conducted on a public benchmark dataset demonstrate that FashionKLIP effectively enhances the performance of e-commerce image-text retrieval upon state-of-the-art VLP models by a large margin. The application of the method in real industrial scenarios also proves the feasibility and efficiency of FashionKLIP.

#17 Entity Contrastive Learning in a Large-Scale Virtual Assistant System [PDF] [Copy] [Kimi²]

Authors: Jonathan Rubin ; Jason Crowley ; George Leung ; Morteza Ziyadi ; Maria Minakova

Conversational agents are typically made up of domain (DC) and intent classifiers (IC) that identify the general subject an utterance belongs to and the specific action a user wishes to achieve. In addition, named entity recognition (NER) performs per token labeling to identify specific entities of interest in a spoken utterance. We investigate improving joint IC and NER models using entity contrastive learning that attempts to cluster similar entities together in a learned representation space. We compare a full virtual assistant system trained using entity contrastive learning to a production baseline system that does not use contrastive learning. We present both offline results, using retrospective test sets, as well as live online results from an A/B test that compared the two systems. In both the offline and online settings, entity contrastive training improved overall performance against production baselines. Furthermore, we provide a detailed analysis of learned entity embeddings, including both qualitative analysis via dimensionality-reduced visualizations and quantitative analysis by computing alignment and uniformity metrics. We show that entity contrastive learning improves alignment metrics and produces well-formed embedding clusters in representation space.

#18 Tab-Cleaner: Weakly Supervised Tabular Data Cleaning via Pre-training for E-commerce Catalog [PDF] [Copy] [Kimi¹]

Authors: Kewei Cheng ; Xian Li ; Zhengyang Wang ; Chenwei Zhang ; Binxuan Huang ; Yifan Ethan Xu ; Xin Luna Dong ; Yizhou Sun

Product catalogs, conceptually in the form of text-rich tables, are self-reported by individual retailers and thus inevitably contain noisy facts. Verifying such textual attributes in product catalogs is essential to improve their reliability. However, popular methods for processing free-text content, such as pre-trained language models, are not particularly effective on structured tabular data since they are typically trained on free-form natural language texts. In this paper, we present Tab-Cleaner, a model designed to handle error detection over text-rich tabular data following a pre-training / fine-tuning paradigm. We train Tab-Cleaner on a real-world Amazon Product Catalog table w.r.t millions of products and show improvements over state-of-the-art methods by 16\% on PR AUC over attribute applicability classification task and by 11\% on PR AUC over attribute value validation task.

#19 Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs [PDF] [Copy] [Kimi²]

Authors: Abishek Komma ; Nagesh Panyam Chandrasekarasastry ; Timothy Leffel ; Anuj Goyal ; Angeliki Metallinou ; Spyros Matsoukas ; Aram Galstyan

Measurement of interaction quality is a critical task for the improvement of large-scale spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.

#20 Tab-CQA: A Tabular Conversational Question Answering Dataset on Financial Reports [PDF] [Copy] [Kimi¹]

Authors: Chuang Liu ; Junzhuo Li ; Deyi Xiong

Existing conversational question answering (CQA) datasets have been usually constructed from unstructured texts in English. In this paper, we propose Tab-CQA, a tabular CQA dataset created from Chinese financial reports that are extracted from listed companies in a wide range of different sectors in the past 30 years. From these reports, we select 2,463 tables, and manually generate 2,463 conversations with 35,494 QA pairs. Additionally, we select 4,578 tables, from which 4,578 conversations with 73,595 QA pairs are automatically created via a template-based method. With the manually- and automatically-generated conversations, Tab-CQA contains answerable and unanswerable questions. For the answerable questions, we further diversify them to cover a wide range of skills, e.g., table retrieval, fact checking, numerical reasoning, so as to accommodate real-world scenarios. We further propose two different tabular CQA models, a text-based model and an operation-based model, and evaluate them on Tab-CQA. Experiment results show that Tab-CQA is a very challenging dataset, where a huge performance gap exists between human and neural models. We will publicly release Tab-CQA as a benchmark testbed to promote further research on Chinese tabular CQA.

#21 KoSBI: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Applications [PDF] [Copy] [Kimi¹]

Authors: Hwaran Lee ; Seokhee Hong ; Joonsuk Park ; Takyoung Kim ; Gunhee Kim ; Jung-woo Ha

Large language models (LLMs) not only learn natural text generation abilities but also social biases against different demographic groups from real-world data. This poses a critical risk when deploying LLM-based applications. Existing research and resources are not readily applicable in South Korea due to the differences in language and culture, both of which significantly affect the biases and targeted demographic groups. This limitation requires localized social bias datasets to ensure the safe and effective deployment of LLMs. To this end, we present KosBi, a new social bias dataset of 34k pairs of contexts and sentences in Korean covering 72 demographic groups in 15 categories. We find that through filtering-based moderation, social biases in generated content can be reduced by 16.47%p on average for HyperClova (30B and 82B), and GPT-3.

#22 Improving Knowledge Production Efficiency With Question Answering on Conversation [PDF] [Copy] [Kimi²]

Authors: Changlin Yang ; Siye Liu ; Sen Hu ; Wangshu Zhang ; Teng Xu ; Jing Zheng

Through an online customer service application, we have collected many conversations between customer service agents and customers. Building a knowledge production system can help reduce the labor cost of maintaining the FAQ database for the customer service chatbot, whose core module is question answering (QA) on these conversations. However, most existing researches focus on document-based QA tasks, and there is a lack of researches on conversation-based QA and related datasets, especially in Chinese language. The challenges of conversation-based QA include: 1) answers may be scattered among multiple dialogue turns; 2) understanding complex dialogue contexts is more complicated than documents. To address these challenges, we propose a multi-span extraction model on this task and introduce continual pre-training and multi-task learning schemes to further improve model performance. To validate our approach, we construct two Chinese datasets using dialogues as the knowledge source, namely cs-qaconv and kd-qaconv, respectively. Experimental results demonstrate that the proposed model outperforms the baseline on both datasets. The online application also verifies the effectiveness of our method. The dataset kd-qaconv will be released publicly for research purposes.

#23 Mitigating the Burden of Redundant Datasets via Batch-Wise Unique Samples and Frequency-Aware Losses [PDF] [Copy] [Kimi¹]

Authors: Donato Crisostomi ; Andrea Caciolai ; Alessandro Pedrani ; Kay Rottmann ; Alessandro Manzotti ; Enrico Palumbo ; Davide Bernardi

Datasets used to train deep learning models in industrial settings often exhibit skewed distributions with some samples repeated a large number of times. This paper presents a simple yet effective solution to reduce the increased burden of repeated computation on redundant datasets. Our approach eliminates duplicates at the batch level, without altering the data distribution observed by the model, making it model-agnostic and easy to implement as a plug-and-play module. We also provide a mathematical expression to estimate the reduction in training time that our approach provides. Through empirical evidence, we show that our approach significantly reduces training times on various models across datasets with varying redundancy factors, without impacting their performance on the Named Entity Recognition task, both on publicly available datasets and in real industrial settings. In the latter, the approach speeds training by up to 87%, and by 46% on average, with a drop in model performance of 0.2% relative at worst. We finally release a modular and reusable codebase to further advance research in this area.

#24 The economic trade-offs of large language models: A case study [PDF] [Copy] [Kimi¹]

Authors: Kristen Howell ; Gwen Christian ; Pavel Fomitchov ; Gitit Kehat ; Julianne Marzulla ; Leanne Rolston ; Jadin Tredup ; Ilana Zimmerman ; Ethan Selfridge ; Joseph Bradley

Contacting customer service via chat is a common practice. Because employing customer service agents is expensive, many companies are turning to NLP that assists human agents by auto-generating responses that can be used directly or with modifications. With their ability to handle large context windows, Large Language Models (LLMs) are a natural fit for this use case. However, their efficacy must be balanced with the cost of training and serving them. This paper assesses the practical cost and impact of LLMs for the enterprise as a function of the usefulness of the responses that they generate. We present a cost framework for evaluating an NLP model’s utility for this use case and apply it to a single brand as a case study in the context of an existing agent assistance product. We compare three strategies for specializing an LLM — prompt engineering, fine-tuning, and knowledge distillation — using feedback from the brand’s customer service agents. We find that the usability of a model’s responses can make up for a large difference in inference cost for our case study brand, and we extrapolate our findings to the broader enterprise space.

#25 Application-Agnostic Language Modeling for On-Device ASR [PDF] [Copy] [Kimi¹]

Authors: Markus Nussbaum-thom ; Lyan Verwimp ; Youssef Oualil

On-device automatic speech recognition systems face several challenges compared to server-based systems. They have to meet stricter constraints in terms of speed, disk size and memory while maintaining the same accuracy. Often they have to serve several ap- plications with different distributions at once, such as communicating with a virtual assistant and speech-to-text. The simplest solution to serve multiple applications is to build application-specific (language) models, but this leads to an increase in memory. Therefore, we explore different data- and architecture-driven language modeling approaches to build a single application-agnostic model. We propose two novel feed-forward architectures that find an optimal trade off between different on-device constraints. In comparison to the application-specific solution, one of our novel approaches reduces the disk size by half, while maintaining speed and accuracy of the original model.